Objectives

  • Produce scatter plots, boxplots, and barplots using ggplot.
  • Set universal plot settings.
  • Describe what faceting is and apply faceting in ggplot.
  • Modify the aesthetics of an existing ggplot plot (including axis labels and colour).
  • Build complex and customized plots from data in a data frame.

Questions

  • What are the components of a ggplot?
  • How do I create scatterplots, lineplots, and barplots?
  • How can I change the aesthetics (ex. colour, transparency) of my plot?
  • How can I create multiple plots at once?

We start by loading the required package. ggplot2 is also included in the tidyverse package.

library(tidyverse)

If not still in the workspace, load the data we saved in the previous lesson.

kings <- read_csv("data_output/kings_plotting.csv", n_max = 54)

Plotting with ggplot2

ggplot2 is a plotting package that makes it simple to create complex plots from data stored in a data frame. It provides a programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.

ggplot2 functions work best with data in the ‘long’ format, i.e., a column for every dimension, and a row for every observation. Well-structured data will save you lots of time when making figures with ggplot2

ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

Each chart built with ggplot2 must include the following

  • Data

  • Aesthetic mapping (aes)

    • Describes how variables are mapped onto graphical attributes
    • Visual attribute of data including x-y axes, color, fill, shape, and alpha
  • Geometric objects (geom)

    • Determines how values are rendered graphically, as bars (geom_bar), scatterplot (geom_point), line (geom_line), etc.

Thus, the template for graphic in ggplot2 is:

<DATA> %>%
    ggplot(aes(<MAPPINGS>)) +
    <GEOM_FUNCTION>()

Remember from the last lesson that the pipe operator %>% places the result of the previous line(s) into the first argument of the function. ggplot is a function that expects a data frame to be the first argument. This allows for us to change from specifying the data = argument within the ggplot function and instead pipe the data into the function.

  • use the ggplot() function and bind the plot to a specific data frame.
kings %>%
    ggplot()
  • define a mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.
kings %>%
    ggplot(aes(x = Midyear, y = Reign_duration))
  • add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). ggplot2 offers many different geoms; we will use some common ones today, including:

    • geom_point() for scatter plots, dot plots, etc.
    • geom_boxplot() for, well, boxplots!
    • geom_line() for trend lines, time series, etc.

To add a geom to the plot use the + operator. Because we have two continuous variables, let’s use geom_point() first:

kings %>% 
  ggplot(aes(x = Midyear, y = Reign_duration)) +
  geom_point()    # basic scatterplot

The + in the ggplot2 package is particularly useful because it allows you to modify existing ggplot objects. This means you can easily set up plot templates and conveniently explore different types of plots, so the above plot can also be generated with code like this, similar to the “intermediate steps” approach in the previous lesson:

ggplot(kings, aes(x = Midyear, y = Reign_duration)) +
  geom_point()+    # basic scatterplot
  geom_smooth()   # visual trend

Note

  • Anything you put in the ggplot() function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in aes().
  • You can also specify mappings for a given geom independently of the mapping defined globally in the ggplot() function.
  • The + sign used to add new layers must be placed at the end of the line containing the previous layer. If, instead, the + sign is added at the beginning of the line containing the new layer, ggplot2 will not add the new layer and will return an error message.
kings_plot <- ggplot(kings, aes(x = Midyear, y = Reign_duration)) +
  geom_point()+    # basic scatterplot
  geom_smooth()+   # visual trend
  labs(title = "How long danish kings ruled over time", 
       x = "Year ", y = "Year they ruled") +   # better title and axes' labels
  theme_bw() +                                 # cleaner look 
  theme(text = element_text(size = 14))        # bigger font to make readable    
## This is the correct syntax for adding layers
kings_plot +
     geom_text(aes(label=Name), size=3) 

## This will not add the new layer and will return an error message
kings_plot
+   geom_text(aes(label=Name), size=3) 

Building your plots iteratively

Building plots with ggplot2 is typically an iterative process. We start by defining the dataset we’ll use, lay out the axes, and choose a geom:

kings %>%
    ggplot(aes(x = Midyear, y = Reign_duration)) +
    geom_point()

Scatter plot of number of items owned versus number of household members.

Then, we start modifying this plot to extract more information from it. For instance, when inspecting the plot we notice that points only appear at the intersection of whole numbers of Midyear and Reign_duration.

To colour each village in the plot differently, you could use a vector as an input to the argument color. However, because we are now mapping features of the data to a colour, instead of setting one colour for all points, the colour of the points now needs to be set inside a call to the aes function. When we map a variable in our data to the colour of the points, ggplot2 will provide a different colour corresponding to the different values of the variable. We will continue to specify the value of alpha, width, and height outside of the aes function because we are using the same value for every point. ggplot2 understands both the Commonwealth English and American English spellings for colour, i.e., you can use either color or colour. Here is an example where we color points by the village of the observation:

kings %>%
    ggplot(aes(x = Midyear, y = Reign_duration)) +
    geom_point(aes(color = House), width = 0.2, height = 0.2)

There appears to be an increasing trend in reign duration over time, but is rule expectancy growing evenly with time or are houses substantially different?

Boxplot

As you will learn, there are multiple ways to plot the a relationship between variables. A boxplot can be used to plot a distribution of datapoints withing a group, in our case individual reigns within a royal house.

kings %>%
   ggplot(aes(x = House, y = Reign_duration, color = House)) +
   geom_boxplot() +
   theme_bw()

Boxplot by house. By adding points to a boxplot, we can have a better idea of the number of measurements and of their distribution:

kings %>%
   ggplot(aes(x = House, y = Reign_duration, color = House)) +
   geom_boxplot() +
   theme_bw()+
   geom_jitter(alpha = 0.3,
            color = "black",
            width = 0.2,
            height = 0.2)

kings %>%
   ggplot(aes(x = House, y = Reign_duration, color = House)) +
   geom_violin() +
   theme_bw()+
   labs(x = "",
        y = "Years on the throne")+
   scale_x_discrete(labels = function(x) str_wrap(x, width = 20))+
   theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 45,
                                     hjust = 0.5, vjust = 0.5)) 

Exercise 1

Use what you just learned to create a scatter plot of midpoint of rulers' reign by duration of length with the house showing in different colours. Does this seem like a good way to display the relationship between these variables? What other kinds of plots might you use to show this type of data?

kings %>%
    ggplot(aes(x = Midyear, y = Reign_duration)) +
    geom_jitter(aes(color = House),
      #  alpha = 0.3,
               height = 0.2)
Scatter plot of royal houses by reign of duration.

This is not a great way to show this type of data because it is difficult to distinguish any trend in such a wide spread of dots. What other plot types could help you visualize this relationship better?

::::::::::::::::::::::::::::::::::::::::::::::::::

ggplot2 themes

In addition to theme_bw(), which changes the plot background to white, ggplot2 comes with several other themes which can be useful to quickly change the look of your visualization. The complete list of themes is available at https://ggplot2.tidyverse.org/reference/ggtheme.html. theme_minimal() and theme_light() are popular, and theme_void() can be useful as a starting point to create a new hand-crafted theme.

The ggthemes package provides a wide variety of options (including an Excel 2003 theme). The ggplot2 extensions website provides a list of packages that extend the capabilities of ggplot2, including additional themes.

Exercise 2

Experiment with at least two different themes. Build the previous plot using each of those themes. Which do you like best?

Customization

Take a look at the ggplot2 cheat sheet, and think of ways you could improve the original smoothed kings_plot.

Now, let’s look at different themes and make sure you have changed names of axes to something more informative than ‘Midyear’ and ‘Reign_duration’ and add a title to the figure:

kings %>%
    ggplot(aes(x = Midyear, y = Reign_duration)) +
    geom_jitter(aes(color = House),
      #  alpha = 0.3,
               height = 0.2)+
    theme_classic()  # try also theme_minimal, and others

Exercise 1

With all of this information in hand, please take another five minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own. Use the RStudio ggplot2 cheat sheet for inspiration. Here are some ideas:

After creating your plot, you can save it to a file in your favourite format. The Export tab in the Plot pane in RStudio will save your plots at low resolution, which will not be accepted by many journals and will not scale well for posters.

Instead, use the ggsave() function, which allows you to easily change the dimension and resolution of your plot by adjusting the appropriate arguments (width, height and dpi).

Make sure you have the fig_output/ folder in your working directory.

my_plot <- 

ggsave("fig_output/name_of_file.png", my_plot, width = 15, height = 10)

Note: The parameters width and height also determine the font size in the saved plot.

Keypoints

  • ggplot2 is a flexible and useful tool for creating plots in R.
  • The data set and coordinate system can be defined using the ggplot function.
  • Additional layers, including geoms, are added using the + operator.
  • Boxplots are useful for visualizing the distribution of a continuous variable.
  • Barplots are useful for visualizing categorical data.
  • Faceting allows you to generate multiple plots based on a categorical variable.